DAX Eigencharts
Research question
When I was 12 years old, I saw an intraday chart of the German stock index (DAX) for the first time. It was in the local newspaper and came with lots of numbers and figures for different stocks, currencies, commodities etc. And all these numbers and especially the DAX intraday chart fascinated me at once: It was love on the first sight!
Since then I have seen many hundrets of DAX charts. And at some point I had the strange feeling, that there is only a limited number of different DAX charts. And if that assumption holds true, one could take advantage of that. My idea was was the following: If I only saw the first half of the DAX intraday chart, I could try to find similar charts and predict a trend for the second half of the day and bet either on stocks going further up or down.
To do this, I need to answer three questions:
- Are there characteristic Eigencharts that explain the intraday DAX movement?
- If yes, how many of these Eigencharts do I need to capture a reasonable amount of the total variance in DAX movements?
- If dimensionality reduction does not work efficiently, can we at least cluster DAX movements?
My approach
Data
There are plenty of possibilities to get intraday charts for all kinds of stocks, indices, currencies etc. from the internet. In this project we work with data from BacktestMarket. They provide a history of DAX charts on a daily basis back to 2000. It comes as so called tick data: You get DAX values for each minute: Open, high, low, close and volume.
The complete data set is too large to put it on GitHub. Therefore we need to focus on a one year history. I picked 2018 data.
Methods
As outlined in the research question, we want to look for Eigencharts. This can best be done with principal component analysis (PCA). Cumulating the obtained eigenvalues gives us a measure of the partial amount of variance captured. This answers the first two questions.
And in the end we want to do some kmeans clustering to answer the third question.
Analysis Part 1 (PCA)
We need to include the following libraries:
library(tidyverse) # data wrangling and ggplot
library(chron) # converting time
library(factoextra) # clustering
library(lubridate) # converting date
library(pracma) # dot-product
library(plotly) # plot with metadata
library(gridExtra) # arrange plots on gridThe following code chunk can only be run with the original data set. As I mentioned above, the complete data set is too large to put it on GitHub. In case you want to work with the whole data, feel free to ask me.
## read in complete data set
# dax_history <- read.csv("dax_history.csv", header = FALSE, sep = ";", col.names = c("date", "time", "open", "high", "low", "close", "volume"))
## keep only the columns we need and change date and time format
# dax_history <- dax_history %>% select(c("date", "time", "close"))
# dax_history$date <- as.Date(dax_history$date, "%d/%m/%Y")
# dax_history$time <- chron(times = dax_history$time)
## filter for 2018 and keep only the core time frame
## note: there is a 7 hour time shift, since it's NY time zone
# dax_2018 <- dax_history[year(dax_history$date) == 2018,]
# dax_2018 <- dax_2018[dax_2018$time >= "01:00:00",]
# dax_2018 <- dax_2018[dax_2018$time <= "14:59:00",]
## to get project on gitHub, need to work with this smaller data set
# write.csv(dax_2018, "dax_2018.csv", row.names = FALSE)Preprocessing Data
We are now ready to further preprocess the data.
- In this project we focus on intraday data for DAX movements on a minute basis for the year 2018. We picked the closing value, which gives us the DAX value for the end of each minute (dax_2018).
- We want to obtain a separate time series for each date, thus pivot the minutes to be columns and put in the closing value for each minute (dax_2018_piv).
- This leaves us with 251 time series, each containing 840 minutes from 01:00:00 to 14:59:00. Remember the time shift of 7 hours, since data stems from US (NY time zone).
rm(list=ls()) # clean environment
dax_2018 <- read.csv("dax_2018.csv")
dax_2018_piv <- dax_2018 %>% pivot_wider(names_from = time, values_from = close)- We need to look for missing values and filter only for the 165 complete time series (dax_2018_piv_clean).
miss_val <- 0 # initialize
for(i in 1:length(dax_2018_piv$date)){miss_val[i] <- (sum(is.na(dax_2018_piv[i,])))}
dax_2018_piv_clean <- dax_2018_piv[miss_val==0,]- We then need to sort columns to have an ascending order of minutes (dax_series).
Example DAX Chart
- Let’s plot an example DAX chart to get a feeling for our time series.
- For ggplot we need to convert the transposed time series to a data.frame (t_vec_df).
minute <- c(1:840) # for the x-axis
series_number <- 1 # just pick the first one
vec <- dax_series[series_number,2:841]
t_vec <- t(vec)
t_vec_df <- as.data.frame(t_vec)
p_dax_example <- ggplot(t_vec_df, mapping = aes(x=minute, y=t_vec_df$V1)) +
geom_line() +
xlab("minute") +
ylab("DAX")
p_dax_exampleDifference Matrix
- A DAX chart can be fully explained by the starting value plus the absolute difference from minute to minute. So let’s calculate the matrix of absolute differences for each time series (dax_series_diff).
- To efficiently calculate the minutely differences we simply shift the dax series by one minute (dax_series_sub1 and dax_series_sub2) and substract the shifted series.
- Note: The first column gives us the date.
Example minutely Changes
- Let’s plot an example series of the differences from minute to minute for one DAX intraday chart time series to see what we are working with.
- For ggplot we need to convert the transposed time series to a data.frame (t_vec_df).
series_number <- 1 # just pick the first one
vec <- dax_series_diff[series_number,2:841]
t_vec <- t(vec)
t_vec_df <- as.data.frame(t_vec)
ggplot(t_vec_df, mapping = aes(x=minute, y=t_vec_df$V1)) +
geom_line() +
xlab("minute") +
ylab("DAX change")Average DAX Chart
- As prerequisite for PCA we now calculate the series of average absolute minutely changes over all 165 dax series (avg_diff).
- We then restore the corresponding chart from the series of average differences and plot the chart.
avg_diff <- colMeans(dax_series_diff[,2:841])
avg_acc <- avg_diff # initialize
for (i in 2:length(t(avg_acc))){
avg_acc[i] <- avg_acc[i-1] + avg_acc[i]
}
# plot DAX restored from the average change per minute
vec <- as.data.frame(avg_acc)
ggplot(vec, mapping = aes(x=minute, y=vec$avg_acc)) +
geom_line() +
xlab("minute") +
ylab("average DAX chart")PCA
- To prepare the input data for our PCA analysis we need to center each time series of minutely differences by substracting the average difference from every time series (dax_series_diff_center).
- Scaling is not necessary since we explicitely want to capture large volatility in intraday movements as a feature.
avg_diff_sub <- avg_diff[2:840] # omit column 01:00:00 (only filled with zeros)
dax_series_diff_center <- dax_series_diff[,3:841] # initialize: omit date column and first column (only filled with zeros)
for (i in 1:165){
dax_series_diff_center[i,] <- dax_series_diff_center[i,] - avg_diff_sub
}Remember: We look for Eigencharts. But before we eventually discover our Eigencharts, we perform PCA on the centered time series of absolute minutely changes (dax_series_diff_center). Some remarks on dimensionality are helpfull:
- We startet with 165 DAX time series, each of length 840 (dax_series).
- We then came to 165 series of minutely changes of length 839, since first column is zero.
- The average minutely change was substracted and we obtained (dax_series_diff_center)
- PCA is done by looking for eigenvalues and eigenvectors of the covariance matrix (dax_cov).
- With an input of 165 time series, we can obtain a maximum of 165 non-zero eigenvalues.
- In our case we find 164 non-zero eigenvalues.
- There are 839 eigenvectors, but with only 164 non-zero eigenvalues we only need to take the first 164 eigenvectors into account.
dax_cov <- cov(dax_series_diff_center)
e <- eigen(dax_cov)
eigenvalues <- e$values
eigenvectors <- e$vector- To make sure we found eigenvalues and eigenvectors, we shortly proof some characteristics.
# proof that evec1 and eval1 fit together
evec1 <- eigenvectors[,1]
eval1 <- eigenvalues[1]
(dax_cov %*% evec1) - (eval1 * evec1) # correct, since obtain vector of zeros
evec2 <- eigenvectors[,2]
evec3 <- eigenvectors[,3]
evec4 <- eigenvectors[,4]
evec5 <- eigenvectors[,5]
dot(evec1,evec2) # correct, orthogonal
dot(evec1,evec1) # correct, length = 1- The so called PC scores give us the projection of the original data onto the eigenvectors.
- However - I had some trouble to get the PC scores. Therefore I needed to convert dax_series_diff_center first to data.frame (dax_series_diff_center_df) and later to matrix - by hand (dax_series_diff_center_matrix).
- Otherwise R doesn’t let us compute the PC scores…
dax_series_diff_center_df <- as.data.frame(dax_series_diff_center)
dax_series_diff_center_matrix <- matrix(rep(0,165*839),nrow=165) # initialize
for (i in 1:839){
dax_series_diff_center_matrix[,i] <- dax_series_diff_center_df[,i]
}
pc_scores <- dax_series_diff_center_matrix %*% eigenvectors- To recover the original matrix of centered minutely changes time series, one needs to multiply the PC scores (pc_scores) with the inverse eigenvector-matrix.
- The inverse in this case is just the transpose.
temp <- pc_scores %*% t(eigenvectors)# indeed looks identical to dax_series_diff_center_matrix, as expected
# view(temp)
# view(dax_series_diff_center_matrix)- Before we finally look at the Eigencharts, we first want to answer the second question: How many Eigencharts do I need to capture a reasonable amount of the total variance in DAX movements?
- By reasonable we mean more than 75% in our case.
- To answer the question, we need to accumulate the eigenvalues and divide by the sum over all eigenvalues.
- Plotting the accumulated variance over the number of necessary eigenvalues confirms that we need at least the first 75 eigenvectors to capture more than 75% of the total variance.
var_eigenvalues_acc <- 0 # initialize
for (i in 1:164){
var_eigenvalues_acc[i] <- sum(eigenvalues[1:i]) / sum(eigenvalues)
}
var_eigenvalues_acc_df <- data.frame(var_eigenvalues_acc)
number_eigenvalue <- c(1:164) # for x-axis
p_sum_eval <- ggplot(var_eigenvalues_acc_df, mapping = aes(x=number_eigenvalue, y=var_eigenvalues_acc_df$var_eigenvalues_acc)) +
geom_point() +
xlab("Number of Eigenvalues") +
ylab("Accumulated Variance captured")
ggplotly(p_sum_eval)Recover DAX series
Let’s have a look, how we can recover a single DAX time series from an increasing number of eigenvectors. To obtain the original data we need to multiply PC scores with the transpose eigenvector matrix - as mentioned earlier. But we can also restrict ourselves to the first n eigenvectors and the first n PC scores to produce an approximation. The more eigenvectors we include, the better the approximation. As seen, recovering a chart from the first 75 eigenvectors captures 75% of the total variance and thus should give us an acceptable approximation of the original DAX chart.
- We have already seen a plot of the first intraday DAX series above.
- Now we try to reproduce the first DAX intraday chart using an increasing number of eigenvectors.
- Using only the first 10 eigenvectors does not fit the original series at all.
- Taking the first 75 eigenvectors into account, we can recognize the original series in its shape, but the overall level is still too low.
- With the first 100 eigenvectors we get resonably close to the original chart.
- And from all 165 eigenvectors we end up with the exact original data.
- Since we applied PCA to centered minutely changes, we need to de-mean the recovered time series of minutely changes by adding the average change from above (avg_diff).
- And to restore the DAX chart, minutely changes need to be summed up and the starting value needs to be added.
# recover 1st DAX series from 165 eigenvectors
recov_165 <- pc_scores[,1:165] %*% t(eigenvectors)[1:165,]
recov_165_s1_diff <- recov_165[1,]
recov_165_s1_diff_demean <- avg_diff # initialize
recov_165_s1_diff_demean[2:840] <- recov_165_s1_diff_demean[2:840] + recov_165_s1_diff
recov_165_s1_diff_demean_acc <- rep(0,840) # initialize
recov_165_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value
for (i in 2:840){
recov_165_s1_diff_demean_acc[i] <- recov_165_s1_diff_demean_acc[i-1] +
recov_165_s1_diff_demean[i]
}
vec_165 <- as.data.frame(recov_165_s1_diff_demean_acc)
p_recov_165 <- ggplot(vec, mapping = aes(x=minute, y=vec_165$recov_165_s1_diff_demean_acc)) +
geom_line() +
xlab("minute") +
ylab("165")
# recover 1st DAX series from 100 eigenvectors
recov_100 <- pc_scores[,1:100] %*% t(eigenvectors)[1:100,]
recov_100_s1_diff <- recov_100[1,]
recov_100_s1_diff_demean <- avg_diff # initialize
recov_100_s1_diff_demean[2:840] <- recov_100_s1_diff_demean[2:840] + recov_100_s1_diff
recov_100_s1_diff_demean_acc <- rep(0,840) # initialize
recov_100_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value
for (i in 2:840){
recov_100_s1_diff_demean_acc[i] <- recov_100_s1_diff_demean_acc[i-1] +
recov_100_s1_diff_demean[i]
}
vec_100 <- as.data.frame(recov_100_s1_diff_demean_acc)
p_recov_100 <- ggplot(vec, mapping = aes(x=minute, y=vec_100$recov_100_s1_diff_demean_acc)) +
geom_line() +
xlab("minute") +
ylab("100")
# recover 1st DAX series from 75 eigenvectors
recov_75 <- pc_scores[,1:75] %*% t(eigenvectors)[1:75,]
recov_75_s1_diff <- recov_75[1,]
recov_75_s1_diff_demean <- avg_diff # initialize
recov_75_s1_diff_demean[2:840] <- recov_75_s1_diff_demean[2:840] + recov_75_s1_diff
recov_75_s1_diff_demean_acc <- rep(0,840) # initialize
recov_75_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value
for (i in 2:840){
recov_75_s1_diff_demean_acc[i] <- recov_75_s1_diff_demean_acc[i-1] +
recov_75_s1_diff_demean[i]
}
vec_75 <- as.data.frame(recov_75_s1_diff_demean_acc)
p_recov_75 <- ggplot(vec, mapping = aes(x=minute, y=vec_75$recov_75_s1_diff_demean_acc)) +
geom_line() +
xlab("minute") +
ylab("75")
# recover 1st DAX series from 10 eigenvectors
recov_10 <- pc_scores[,1:10] %*% t(eigenvectors)[1:10,]
recov_10_s1_diff <- recov_10[1,]
recov_10_s1_diff_demean <- avg_diff # initialize
recov_10_s1_diff_demean[2:840] <- recov_10_s1_diff_demean[2:840] + recov_10_s1_diff
recov_10_s1_diff_demean_acc <- rep(0,840) # initialize
recov_10_s1_diff_demean_acc[1] <- as.double(dax_series[1,2]) # set starting value
for (i in 2:840){
recov_10_s1_diff_demean_acc[i] <- recov_10_s1_diff_demean_acc[i-1] +
recov_10_s1_diff_demean[i]
}
vec_10 <- as.data.frame(recov_10_s1_diff_demean_acc)
p_recov_10 <- ggplot(vec, mapping = aes(x=minute, y=vec_10$recov_10_s1_diff_demean_acc)) +
geom_line() +
xlab("minute") +
ylab("10")
grid.arrange(p_recov_10, p_recov_75, p_recov_100, p_recov_165, nrow = 4)Or maybe it can better be visualized in one plot:
colnames(vec_10) <- "index"
vec_10 <- vec_10 %>% mutate("num_evecs"=10, "minute"=minute)
colnames(vec_75) <- "index"
vec_75 <- vec_75 %>% mutate("num_evecs"=75, "minute"=minute)
colnames(vec_100) <- "index"
vec_100 <- vec_100 %>% mutate("num_evecs"=100, "minute"=minute)
colnames(vec_165) <- "index"
vec_165 <- vec_165 %>% mutate("num_evecs"=165, "minute"=minute)
vec_all <- rbind(vec_10, vec_75, vec_100, vec_165)
vec_all$num_evecs <- as.factor(vec_all$num_evecs)
ggplot(data = vec_all, aes(x=minute, y=index)) +
geom_line(aes(colour=num_evecs)) +
scale_color_manual(breaks = c("10", "75", "100", "165"),
values=c("red", "blue", "green", "black")) +
ggtitle("Recover DAX chart from various number of eigenvectors")PCA Conclusion
We started with 165 time series of intraday DAX movements. As we have seen, we need at least 75 eigenvectors to capture a reasonable amount - in this case 75% - of the total variance. Before the analysis I hoped to end up with - let’s say - ten Eigencharts to explain the world of intraday DAX charts and to predict movements for new charts that are not part of the test set. This doesn’t work, dimensionality reduction is not as efficient as I expected.
Example Eigencharts
- Nevertheless, we want to look at the first eigenvectors and create the according eigencharts, to get a better feeling on how an eigenchart looks.
- The eigenvectors represent minutely changes. The according eigenchart cumulates the minutely changes.
- We then plot the first 4 eigencharts as an example.
eigencharts <- matrix(rep(0, 840 * 839), nrow = 840) # initialize
for (i in 1:839){
eigencharts[i+1,] <- eigencharts[i,] + eigenvectors[i,]
}
eigencharts <- as.data.frame(eigencharts)
p_eigencharts_1 <- ggplot(eigencharts, mapping = aes(x=minute, y=eigencharts[,1])) +
geom_line() +
xlab("minute") +
ylab("index") +
ggtitle(paste("Eigenchart number:",1))
p_eigencharts_2 <- ggplot(eigencharts, mapping = aes(x=minute, y=eigencharts[,2])) +
geom_line() +
xlab("minute") +
ylab("index") +
ggtitle(paste("Eigenchart number:",2))
p_eigencharts_3 <- ggplot(eigencharts, mapping = aes(x=minute, y=eigencharts[,3])) +
geom_line() +
xlab("minute") +
ylab("index") +
ggtitle(paste("Eigenchart number:",3))
p_eigencharts_4 <- ggplot(eigencharts, mapping = aes(x=minute, y=eigencharts[,4])) +
geom_line() +
xlab("minute") +
ylab("index") +
ggtitle(paste("Eigenchart number:",4))
grid.arrange(p_eigencharts_1, p_eigencharts_2, p_eigencharts_3, p_eigencharts_4, nrow=2)Analysis Part 2 (kmeans)
The first part of the story ends here. Explaining the universe of intraday DAX charts by only a few eigencharts failed. But are there similarities in all these DAX charts? Can they be clustered? But how do we want to define similarity of two DAX charts?
One way to start with is the following: We split each daily DAX chart in two parts: The first 7 hours of trading (a.m.) and the second 7 hours (p.m.).
Preprocessing the data
- Let’s start over with all 165 intraday DAX charts from the first part of this project (dax_series).
- We create a new table (dax_movement) and keep only the date column and calculate the DAX change in the first half (01:00:00 - 08:00:00) and the second half (08:00:00 - 14:59:00) of the trading day.
- Call the first half am_movement and the second half pm_movement.
# put together the dax change of the 1st half and the 2nd half day, plus date
dax_movement <- cbind(dax_series["date"],
dax_series["08:00:00"] - dax_series["01:00:00"],
dax_series["14:59:00"] - dax_series["08:00:00"])
colnames(dax_movement) <- c("date", "am_movement", "pm_movement")- Visualizing the data can best be done with a scatter plot with am_movement on the x-axis and pm_movement on the y-axis.
ggplot(dax_movement, mapping = aes(x=am_movement, y=pm_movement)) +
geom_point() +
ggtitle("DAX intraday movement a.m. vs. p.m.")kmeans
Let’s look for homogeneous subgroups in dax movements by performing kmeans clustering. kmeans clustering is a method for partitioning the observations into k groups. The number of clusters k needs to be pre-specified. The kmeans algorithm produces a large number of return values, notably:
- cluster: the cluster to which each chart is allocated.
- centers: a matrix of cluster centers.
- withinss: within-cluster sum of squares (= the sum of the squared Euclidean distances to the cluster center)
- tot.withinss: Total within-cluster sum of squares (= the sum of the withinss).
- size: The number of charts in each cluster.
We will use the Lloyd algorithm as discussed in class. There are other algorithms and the major difference between the algorithms is how they do the initial assignment of data points to clusters: The Lloyd algorithm therefore selects random DAX movements and defines them as the initial cluster centers. All other DAX movements are then assigned to these cluster centers such that the within-cluster variance is minimized. Then the cluster center is re-calculated. This represents one iteration of the Lloyd algorithm.
We increase the maximum number of iterations iter.max from default (= 10 iterations) to 30, to ensure the algorithm to converge. And we want to repeat the cluster assignment 20 times by setting nstart. This avoids ending up in a local minimum of total within-cluster sum of squares.
# kmeans using Lloyd-algorithm: Looking for 3 clusters in this example
km <- kmeans(dax_movement[,2:3], centers = 3, algorithm = "Lloyd", iter.max = 30, nstart = 20)
# make sure to get close to the absolute minimum of total within-cluster sum of squares
i = 1
for (i in 1:20){
km <- kmeans(dax_movement[,2:3], centers = 3, algorithm="Lloyd", iter.max = 30, nstart = 1)
print(km$tot.withinss)
i = i+1
}## [1] 1202976
## [1] 1202662
## [1] 1202587
## [1] 1202662
## [1] 1202662
## [1] 1202662
## [1] 1202587
## [1] 1473467
## [1] 1202662
## [1] 1202587
## [1] 1202662
## [1] 1203043
## [1] 1202662
## [1] 1207846
## [1] 1209176
## [1] 1202587
## [1] 1203963
## [1] 1202662
## [1] 1204072
## [1] 1208990
To find the optimal number k of clusters, we use the function wss_fct from class and apply the ellbow-method.
# Function that runs kmeans clustering and extracts the total within-cluster sum of square
wss_fct <- function(k) {
km <- kmeans(dax_movement[,2:3], centers=k, algorithm="Lloyd", iter.max = 30, nstart = 20)
print(km$tot.withinss)
}
# Define a vector for k
num_centers <- 1:15
# extract wss for 1 to 15 clusters and put LABELS to x- and y-axes
wss <- sapply(num_centers, FUN=wss_fct)## [1] 2569798
## [1] 1827691
## [1] 1202662
## [1] 941697.4
## [1] 754808
## [1] 634031.5
## [1] 533861.3
## [1] 480089
## [1] 416467.8
## [1] 374374.4
## [1] 337051.4
## [1] 320163.9
## [1] 288251.3
## [1] 260541.5
## [1] 246161.3
data.frame(num_centers, wss) %>%
ggplot(aes(x = num_centers, y = wss)) +
geom_point(size = 3) + geom_line() +
xlab("Number of clusters") +
ylab("Total within-clusters sum of squares") +
ggtitle("Finding optimal number of clusters")Having three or four clusters seem to be the optimum in this case. We continue with three clusters here, add the cluster assignment to the data (dax_movement_cluster) and plot the clustering results.
km <- kmeans(dax_movement[,2:3], centers = 3, algorithm = "Lloyd", iter.max = 30, nstart = 20)
# add cluster-assignment to the data
dax_movement_cluster <- bind_cols(dax_movement, cluster = as.factor(km$cluster))
# plot
p <- dax_movement_cluster %>%
ggplot() +
geom_point(aes(am_movement, pm_movement, color = cluster, text = date)) +
geom_point(data = as.data.frame(km$centers), aes(am_movement, pm_movement), size = 3) +
ggtitle("Cluster assignment of DAX movements")
ggplotly(p)kmeans Conclusion
We have divided the daily DAX movements in two halfs: a.m. and p.m. referring to the first and last 7 hours of a trading day, respectively. This can be used for applying simple trading strategies. For instance by knowing how the DAX has performed in the first half, one could bet on the performance in the second half. This can be done with so called KnockOut certificates.
However, the results we obtained in our clustering analysis are not useful for several reasons:
- Although we find three clusters to be an optimum, it looks more like ONE big diffuse cluster, since clusters are not seperated here.
- There is a big range in both, am_movements and pm_movements and movements seam to - more or less - average out.
- We do not see a clear tendency for the DAX to move into any direction, it seems to be randomly distributed.
## [1] -68.22277
## [1] -5.584158
## [1] 66.91406
## [1] -6.148438
Outlook
With the methods we learned in class, I am finally able to analyse the bahaviour of daily stock market indices. My ideas to find a small number of Eigencharts to explain the universe of DAX movements as well as clustering DAX performance by a.m. and p.m movements have failed. But I have some more ideas in my head. And: If it was easy to predict stock markets, many people would make huge profits. That would be boring… ;-)